Feature: Incremental Append Scan by smaheshwar-pltr · Pull Request #3364 · apache/iceberg-python

smaheshwar-pltr · 2026-05-15T15:47:53Z

Closes #2634.

Rationale for this change

Largely a revival of Revival of #2634 (comment). Please see that issue and previous PRs for context and motivation.

References: https://github.com/apache/iceberg (containing Iceberg-Java and Spark, both are relevant to us), and apache/iceberg-cpp#590. Note: I've asked an LLM to drop review comments on this PR linking to relevant places in the references mentioned, to aid reviewing.

Are these changes tested?

Yes, both unit and integration tests can be found in this PR.

Are there any user-facing changes?

Yes, there are removal of private methods but not public changes apart from the new feature. Please see the PR comments for more information.

smaheshwar-pltr · 2026-05-17T16:26:08Z

+class BaseScan(ABC):
+    """A base class for all table scans."""

-class TableScan(ABC):


This isn't a rename or removal (the diff is misleading) - TableScan is just moved below the new BaseScan class

smaheshwar-pltr · 2026-05-17T16:26:45Z

-            and (manifest.sequence_number or INITIAL_SEQUENCE_NUMBER) >= min_sequence_number
-        )
+    @property
+    def partition_filters(self) -> KeyDefaultDict[int, BooleanExpression]:


This is a public property, so keeping it for back-compat

smaheshwar-pltr · 2026-05-17T16:27:52Z



 class DataScan(TableScan):
-    def _build_partition_projection(self, spec_id: int) -> BooleanExpression:


This is now moved into ManifestGroupPlanner so it can be shared with DataScan and IncrementalAppendScan.

smaheshwar-pltr · 2026-05-17T16:28:16Z

+    def plan_files(
+        self,
+        manifests: Iterable[ManifestFile],
+        manifest_entry_filter: Callable[[ManifestEntry], bool] = lambda _: True,


This manifest filter is new. Introducing that for append scan logic where some manifests are skipped

smaheshwar-pltr · 2026-05-18T15:19:50Z

            table_identifier=self._identifier,
        )

+    def incremental_append_scan(


New convenience method mirroring Table.scan (naming thought). Args mirror scan minus snapshot_id plus the two snapshot-range args.

smaheshwar-pltr · 2026-05-18T15:20:06Z

+                to return in the output dataframe.
+            case_sensitive:
+                If True column matching is case sensitive.
+            from_snapshot_id_exclusive:


Requiring from_snapshot_id_exclusive to be non-None at plan time is a deliberate divergence from Java's IncrementalScan semantics (where the start defaults to the oldest ancestor of the end snapshot when not configured). Follows Spark's required start-snapshot-id (docs). Argument here — TL;DR an append scan only reads append snapshots, so "from the oldest ancestor" would be misleading after a replace.

smaheshwar-pltr · 2026-05-18T15:20:17Z

    ) -> DataScan:
        raise ValueError("Cannot scan a staged table")

+    def incremental_append_scan(


Mirrors StagedTable.scan two lines up — staged tables have no committed metadata to scan against.

smaheshwar-pltr · 2026-05-18T15:20:33Z

+A = TypeVar("A", bound="BaseScan", covariant=True)
+

+class BaseScan(ABC):


BaseScan is new; TableScan is unchanged in surface but now subclasses it. Why split:

This PR keeps snapshot_id, catalog, table_identifier, use_ref, snapshot(), and abstract count() on TableScan to avoid the breaking change #533 introduced when it dropped these.

That makes TableScan snapshot-specific, so it isn't a sensible base class for incremental scans (which have two snapshot IDs, not one).

BaseScan therefore holds the genuinely-shared surface (row filter, projection, options, limit, chaining helpers, format-converter sinks built on to_arrow()).

I don't love this — if breaking TableScan were acceptable we could collapse the hierarchy like #533. See prior thinking and follow-up.

Also pointing out: I could've avoided changing existing code entirely and having a completely independent class for append scans with duplicated manifest planning logic. I felt as though:

the hierarchy with TableScan and DataScan (prior to this PR) would then feel odd with a fully independent IncrementalAppendScan

duplicated code is code smell, so I've gone with a refactor here. To note that it's largely just moving code around than anything else! Let me know what folks think or design suggestions here, very open to changes

(I realise this makes the diff here scary 😄 )

smaheshwar-pltr · 2026-05-18T15:20:46Z

+    @abstractmethod
+    def plan_files(self) -> Iterable[ScanTask]: ...
+
+    def to_arrow(self) -> pa.Table:


Materialization stays abstract on BaseScan. Both DataScan and IncrementalAppendScan implement to_arrow / to_arrow_batch_reader as one-line delegations to the module-level helpers _to_arrow_via_file_scan_tasks / _to_arrow_batch_reader_via_file_scan_tasks above.

A BaseScan-level default would require Iterable[FileScanTask], but BaseScan.plan_files() returns Iterable[ScanTask] — Liskov-widened so that future non-file scans (e.g. changelog) can return a different task type. Mypy arg-type makes the default-on-base form impossible without specialising the base. Helpers keep the dedup without that constraint.

to_pandas / to_polars / to_duckdb / to_ray do get pulled up to BaseScan as defaults — they only need to_arrow() on self, no FileScanTask typing. Prior thinking.

smaheshwar-pltr · 2026-05-18T15:21:03Z

+    def with_case_sensitive(self: A, case_sensitive: bool = True) -> A:
+        return self.update(case_sensitive=case_sensitive)
+
+    def to_pandas(self, **kwargs: Any) -> pd.DataFrame:


to_pandas / to_polars were previously abstract on TableScan. They now have default implementations on BaseScan (built on to_arrow()). Prior thinking.

smaheshwar-pltr · 2026-05-18T15:21:04Z

+        """
+        return self.to_arrow().to_pandas(**kwargs)
+
+    def to_duckdb(self, table_name: str, connection: DuckDBPyConnection | None = None) -> DuckDBPyConnection:


to_duckdb and to_ray were previously only on DataScan, not even on TableScan. Pulling them up to BaseScan means TableScan and any external subclass now inherit them. Net additive. Prior thinking.

smaheshwar-pltr · 2026-05-18T15:21:06Z

+S = TypeVar("S", bound="TableScan", covariant=True)
+
+
+class TableScan(BaseScan, ABC):


Was a direct ABC; now extends BaseScan. All previously-present fields, methods, and abstract API are preserved (see #3364 (comment)). The only behavioural delta is that previously-abstract methods on TableScan (to_pandas, to_polars) now have default implementations inherited from BaseScan.

smaheshwar-pltr · 2026-05-18T15:21:25Z

    @cached_property
-    def partition_filters(self) -> KeyDefaultDict[int, BooleanExpression]:
-        return KeyDefaultDict(self._build_partition_projection)
+    def _manifest_planner(self) -> ManifestGroupPlanner:


Cached so that the planner's own partition_filters cached_property lives for the scan's lifetime — matches the pre-PR caching behaviour on DataScan (where partition_filters was itself a cached_property directly).

smaheshwar-pltr · 2026-05-18T15:21:26Z

-        partition_type = spec.partition_type(self.table_metadata.schema())
-        partition_schema = Schema(*partition_type.fields)
-        partition_expr = self.partition_filters[spec_id]
+    def scan_plan_helper(self) -> Iterator[list[ManifestEntry]]:


Public; only call site within PyIceberg is pyiceberg/table/inspect.py. Kept for back-compat — external library users may rely on it. Body now delegates to ManifestGroupPlanner.plan_manifest_entries so the work isn't duplicated with IncrementalAppendScan. (Prior context on whether the underscore-prefixed helpers needed a deprecation cycle — they're gone now and aren't documented as supported.)

smaheshwar-pltr · 2026-05-18T15:21:27Z

-                which can be used to read a stream of record batches one by one.
-        """
-        import pyarrow as pa
+class IncrementalAppendScan(BaseScan):


Mirrors Java's IncrementalAppendScan interface and BaseIncrementalAppendScan implementation. Only the append variant of IncrementalScan — changelog scan is out of scope here.

smaheshwar-pltr · 2026-05-18T15:21:59Z


-    def to_pandas(self, **kwargs: Any) -> pd.DataFrame:
-        """Read a Pandas DataFrame eagerly from this Iceberg table.
+    def from_snapshot_exclusive(self: IAS, from_snapshot_id_exclusive: int | None) -> IAS:


Maps to Java's fromSnapshotExclusive(long). We don't expose the String ref overload or useBranch — Spark passes raw IDs anyway, and ref support can be added later without breaking anything.

smaheshwar-pltr · 2026-05-18T15:22:00Z


-        con = connection or duckdb.connect(database=":memory:")
-        con.register(table_name, self.to_arrow())
+    def projection(self) -> Schema:


Always uses the table's current schema, unlike TableScan.projection() which uses the snapshot's schema when snapshot_id is set. Matches Java: BaseTable.newIncrementalAppendScan constructs the scan with schema(), which on BaseTable.schema() returns ops.current().schema() — the table's current schema, not snapshot-bound. C++ does the same: TableScanBuilder::ResolveSnapshotSchema falls through to metadata_->Schema() for incremental scans (no snapshot_id on the context). Older-schema rows in range get NULL for new columns — covered by test_incremental_append_scan_schema_evolution_within_range.

smaheshwar-pltr · 2026-05-18T15:22:02Z

+        return current_schema.select(*self.selected_fields, case_sensitive=self.case_sensitive)

-        return con
+    def plan_files(self) -> Iterable[FileScanTask]:


Mirrors Java's BaseIncrementalAppendScan.doPlanFiles and appendFilesFromSnapshots — walk ancestors, filter to append snapshots, dedup manifests whose added_snapshot_id is in range, then filter manifest entries by (snapshot_id in range, status == ADDED). Set semantics on the manifest dedup match the Java snippet and rely on ManifestFile.__eq__/__hash__ being defined (which they are on main since #2233).

smaheshwar-pltr · 2026-05-18T15:22:22Z


-    def to_polars(self) -> pl.DataFrame:
-        """Read a Polars DataFrame from this Iceberg table.
+    def _validate_and_resolve_snapshots(self) -> tuple[int, int]:


Two semantic notes:

from (exclusive) is validated via is_parent_ancestor_of, not is_ancestor_of — matches Java's BaseIncrementalScan.fromSnapshotIdExclusive (see the inline comment there about expiry) and C++'s internal::FromSnapshotIdExclusive. This admits cursors whose from snapshot has since been expired (canonical incremental-ingestion pattern); fabricated IDs are still rejected.

Equal from/to raises (a snapshot is never its own parent ancestor), again matching Java/C++.

smaheshwar-pltr · 2026-05-18T15:22:23Z

+        return self.from_snapshot_id_exclusive, to_snapshot_id
+
+
+class ManifestGroupPlanner:


Motivated by Java's ManifestGroup — both DataScan and IncrementalAppendScan need to plan file scan tasks from a set of manifests with optional filtering, and this is the natural shape for that (prior thinking). All the _build_* helpers and _check_sequence_number are moved from DataScan, not new.

smaheshwar-pltr · 2026-05-18T15:22:25Z

-        return result
+        executor = ExecutorFactory.get_or_create()
+        return executor.map(
+            lambda args: _open_manifest(*args),


Extracted so both DataScan.scan_plan_helper (kept for back-compat / inspect.py) and plan_files below can share the partition-summary / per-file evaluator pipeline.

smaheshwar-pltr · 2026-05-18T15:22:39Z

        yield from ancestors_of(to_snapshot, table_metadata)


+def ancestors_between_ids(


Mirrors Java's SnapshotUtil.ancestorsBetween. Differs from the existing ancestors_between (snapshot-based, inclusive-inclusive) above by taking IDs and being exclusive-inclusive, to match the incremental-scan validation pattern. Raises if to_snapshot_id_inclusive is missing from metadata, mirroring Java.

smaheshwar-pltr · 2026-05-18T15:22:40Z

+        yield from ancestors_of(to_snapshot, table_metadata)
+
+
+def is_parent_ancestor_of(snapshot_id: int, ancestor_parent_snapshot_id: int, table_metadata: TableMetadata) -> bool:


Mirrors Java's SnapshotUtil.isParentAncestorOf, including the Cannot find snapshot raise on missing snapshot (Java throws one hop down, via ancestorsOf(long, lookup)).

smaheshwar-pltr · 2026-05-18T15:23:14Z

+
+@pytest.mark.integration
+@pytest.mark.parametrize("catalog", [lf("session_catalog_hive"), lf("session_catalog")])
+def test_incremental_append_scan_metrics_pruning(catalog: Catalog) -> None:


Filters on a non-partition column (number), so the manifest and partition evaluators degenerate to ALWAYS_TRUE and it's the per-file metrics evaluator (column min/max/null stats) that must do all the pruning. Covers a layer of ManifestGroupPlanner that the existing DataScan integration coverage doesn't exercise end-to-end through a real scan.

smaheshwar-pltr · 2026-05-18T20:15:31Z

-        Returns:
-            pa.Table: Materialized Arrow Table from the Iceberg table's DataScan
-        """
+    def count(self) -> int:


(This code is not new, just moved)

smaheshwar-pltr · 2026-05-18T20:16:55Z

-    @cached_property
-    def partition_filters(self) -> KeyDefaultDict[int, BooleanExpression]:
-        return KeyDefaultDict(self._build_partition_projection)
+def _to_arrow_via_file_scan_tasks(scan: BaseScan, tasks: Iterable[FileScanTask]) -> pa.Table:


Introducing this helper + (the one below) specialised for FileScanTask. We don't want to have this be the default implementation on BaseScan because it requires FileScanTask specifically and not all table scans will have FileScanTask planned in general (i.e. changelogs)

- Raise on missing snapshot in `is_parent_ancestor_of`. - Add empty-range integration test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Feature: Incremental Append Scan

1fd9274

smaheshwar-pltr force-pushed the sm/incremental-append-scan-v2 branch from f86284f to 1fd9274 Compare May 15, 2026 23:32

smaheshwar-pltr commented May 17, 2026

View reviewed changes

smaheshwar-pltr added 2 commits May 18, 2026 15:23

Nits

2f349ab

Nits

510b586

smaheshwar-pltr commented May 18, 2026

View reviewed changes

Nits

dd17340

smaheshwar-pltr commented May 18, 2026

View reviewed changes

Nits

b344ea5

- Raise on missing snapshot in `is_parent_ancestor_of`. - Add empty-range integration test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

smaheshwar-pltr marked this pull request as ready for review May 18, 2026 22:45



		class DataScan(TableScan):
		def _build_partition_projection(self, spec_id: int) -> BooleanExpression:

		A = TypeVar("A", bound="BaseScan", covariant=True)


		class BaseScan(ABC):

		S = TypeVar("S", bound="TableScan", covariant=True)


		class TableScan(BaseScan, ABC):

		return self.from_snapshot_id_exclusive, to_snapshot_id


		class ManifestGroupPlanner:

		yield from ancestors_of(to_snapshot, table_metadata)


		def ancestors_between_ids(

		yield from ancestors_of(to_snapshot, table_metadata)


		def is_parent_ancestor_of(snapshot_id: int, ancestor_parent_snapshot_id: int, table_metadata: TableMetadata) -> bool:

Conversation

smaheshwar-pltr commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

smaheshwar-pltr May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smaheshwar-pltr May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

smaheshwar-pltr commented May 15, 2026 •

edited

Loading

smaheshwar-pltr May 17, 2026 •

edited

Loading

smaheshwar-pltr May 17, 2026 •

edited

Loading

smaheshwar-pltr May 18, 2026 •

edited

Loading

smaheshwar-pltr May 18, 2026 •

edited

Loading

smaheshwar-pltr May 18, 2026 •

edited

Loading

smaheshwar-pltr May 18, 2026 •

edited

Loading

smaheshwar-pltr May 18, 2026 •

edited

Loading

smaheshwar-pltr May 18, 2026 •

edited

Loading

smaheshwar-pltr May 18, 2026 •

edited

Loading

smaheshwar-pltr May 18, 2026 •

edited

Loading

smaheshwar-pltr May 18, 2026 •

edited

Loading

smaheshwar-pltr May 18, 2026 •

edited

Loading